AITopics | data subset selection

ORIENT: SubmodularMutualInformationMeasures forDataSubsetSelectionunderDistributionShift

Neural Information Processing SystemsFeb-11-2026, 23:56:01 GMT

The recent success of deep learning frameworks in applications such as image classification [9], speech recognition [20], and object detection [13] stems primarily from the availability of large amounts of labeled data.

artificial intelligence, machine learning, orient, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

SIMILAR: SubmodularInformationMeasuresBased ActiveLearningInRealisticScenarios

Neural Information Processing SystemsFeb-10-2026, 04:27:38 GMT

Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples.

artificial intelligence, learning, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

ORIENT: Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift

Neural Information Processing SystemsDec-25-2025, 07:45:01 GMT

Real-world machine-learning applications require robust models that generalize well to distribution shift settings, which is typical in real-world situations. Domain adaptation techniques aim to address this issue of distribution shift by minimizing the disparities between domains to ensure that the model trained on the source domain performs well on the target domain. Nevertheless, the existing domain adaptation methods are computationally very expensive. In this work, we aim to improve the efficiency of existing supervised domain adaptation (SDA) methods by using a subset of source data that is similar to target data for faster model training. Specifically, we propose ORIENT, a subset selection framework that uses the submodular mutual information (SMI) functions to select a source data subset similar to the target data for faster training. Additionally, we demonstrate how existing robust subset selection strategies, such as GLISTER, GRADMATCH, and CRAIG, when used with a held-out query set, fit within our proposed framework and demonstrate the connections with them. Finally, we empirically demonstrate that SDA approaches like d-SNE, CCSA, and standard Cross-entropy training, when employed together with ORIENT, achieve a) faster training and b) better performance on the target data.

data subset selection, orient, submodular mutual information measure, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Neural Information Processing SystemsDec-25-2025, 02:30:43 GMT

Deep neural networks have seen great success in recent years; however, training a deep model is often challenging as its performance heavily depends on the hyper-parameters used. In addition, finding the optimal hyper-parameter configuration, even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms, can be time-consuming, requiring multiple training runs over the entire datasetfor different possible sets of hyper-parameters. Our central insight is that using an informative subset of the dataset for model training runs involved in hyper-parameter optimization, allows us to find the optimal hyper-parameter configuration significantly faster. In this work, we propose AUTOMATA, a gradient-based subset selection framework for hyper-parameter tuning. We empirically evaluate the effectiveness of AUTOMATA in hyper-parameter tuning through several experiments on real-world datasets in the text, vision, and tabular domains. Our experiments show that using gradient-based data subsets for hyper-parameter tuning achieves significantly faster turnaround times and speedups of 3 -30 while achieving comparable performance to the hyper-parameters found using the entire dataset.

automata, compute-efficient hyper-parameter tuning, data subset selection, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

Market-Driven Subset Selection for Budgeted Training

Jha, Ashish, Leplat, Valentin, Phan, AH

arXiv.org Artificial IntelligenceOct-21-2025

Training large language models on massive datasets is computationally expensive, yet empirical evidence suggests that substantial portions of training examples contribute minimally to final performance. Data subset selection addresses this inefficiency by identifying small, high-utility subsets under resource constraints. However, example utility is inherently multi-faceted, encompassing uncertainty, distributional rarity, and diversity signals that are heterogeneous and typically combined through ad hoc weighted sums lacking theoretical grounding. We propose a market-based framework that treats each training example as a tradeable contract and employs the Logarithmic Market Scoring Rule to aggregate multiple utility signals into coherent prices. Heterogeneous signals act as traders, a single liquidity parameter controls concentration versus smoothing, and topic-wise normalization ensures calibrated aggregation. Token budgets are handled explicitly through a price-per-token decision rule with an interpretable length-bias parameter. We establish theoretical connections to maximum-entropy aggregation and provide utility recovery guarantees under noisy but monotone signals. On GSM8K mathematical reasoning under strict 60k-token budgets, our selector achieves parity with strong single-signal baselines while exhibiting lower variance and incurring less than 0.1 GPU-hour overhead. On AGNews classification at 5-25\% retention rates, the market formulation delivers competitive accuracy with improved stability. Our framework unifies multi-signal data curation under fixed computational budgets for prompt-level reasoning and classification tasks.

aggregation, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.02456

Country: Europe > Russia (0.46)

Genre:

Overview (0.67)
Research Report (0.64)

Industry:

Leisure & Entertainment (0.49)
Banking & Finance > Trading (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

Jha, Ashish, Ahmadi-Asl, Salman

arXiv.org Artificial IntelligenceOct-10-2025

Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(\ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N \times N$ pairwise similarities and explicit $N \times \ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.

artificial intelligence, machine learning, selection, (13 more...)

arXiv.org Artificial Intelligence

2510.0247

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.91)

Add feedback

Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift

Neural Information Processing SystemsAug-19-2025, 00:44:26 GMT

Nevertheless, the existing domain adaptation methods are computationally very expensive.

artificial intelligence, domain adaptation, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Texas (0.05)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

ORIENT: Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift

Neural Information Processing SystemsJan-18-2025, 22:11:45 GMT

Real-world machine-learning applications require robust models that generalize well to distribution shift settings, which is typical in real-world situations. Domain adaptation techniques aim to address this issue of distribution shift by minimizing the disparities between domains to ensure that the model trained on the source domain performs well on the target domain. Nevertheless, the existing domain adaptation methods are computationally very expensive. In this work, we aim to improve the efficiency of existing supervised domain adaptation (SDA) methods by using a subset of source data that is similar to target data for faster model training. Specifically, we propose ORIENT, a subset selection framework that uses the submodular mutual information (SMI) functions to select a source data subset similar to the target data for faster training. Additionally, we demonstrate how existing robust subset selection strategies, such as GLISTER, GRADMATCH, and CRAIG, when used with a held-out query set, fit within our proposed framework and demonstrate the connections with them.

data subset selection, orient, submodular mutual information measure, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.43)

Add feedback

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Neural Information Processing SystemsJan-18-2025, 16:10:11 GMT

Deep neural networks have seen great success in recent years; however, training a deep model is often challenging as its performance heavily depends on the hyper-parameters used. In addition, finding the optimal hyper-parameter configuration, even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms, can be time-consuming, requiring multiple training runs over the entire datasetfor different possible sets of hyper-parameters. Our central insight is that using an informative subset of the dataset for model training runs involved in hyper-parameter optimization, allows us to find the optimal hyper-parameter configuration significantly faster. In this work, we propose AUTOMATA, a gradient-based subset selection framework for hyper-parameter tuning. We empirically evaluate the effectiveness of AUTOMATA in hyper-parameter tuning through several experiments on real-world datasets in the text, vision, and tabular domains.

automata, compute-efficient hyper-parameter tuning, data subset selection, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.63)

Add feedback

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models

Renduchintala, H S V N S Kowndinya, Killamsetty, Krishnateja, Bhatia, Sumit, Aggarwal, Milan, Ramakrishnan, Ganesh, Iyer, Rishabh, Krishnamurthy, Balaji

arXiv.org Artificial IntelligenceOct-19-2023

A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to $\sim99\%$ of the performance of the fully-trained models. We made our framework publicly available at https://github.com/Efficient-AI/ingenious.

ngenious, selection, subset, (16 more...)

arXiv.org Artificial Intelligence

2305.06677

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.04)
Europe > Russia (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

data subset selection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

ORIENT: SubmodularMutualInformationMeasures forDataSubsetSelectionunderDistributionShift

SIMILAR: SubmodularInformationMeasuresBased ActiveLearningInRealisticScenarios

ORIENT: Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Market-Driven Subset Selection for Budgeted Training

SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift

ORIENT: Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models